{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Metadata Replacement + Node Sentence Window\n",
    "\n",
    "In this notebook, we use the `SentenceWindowNodeParser` to parse documents into single sentences per node. Each node also contains a \"window\" with the sentences on either side of the node sentence.\n",
    "\n",
    "Then, during retrieval, before passing the retrieved sentences to the LLM, the single sentences are replaced with a window containing the surrounding sentences using the `MetadataReplacementNodePostProcessor`.\n",
    "\n",
    "This is most useful for large documents/indexes, as it helps to retrieve more fine-grained details.\n",
    "\n",
    "By default, the sentence window is 5 sentences on either side of the original sentence.\n",
    "\n",
    "In this case, chunk size settings are not used, in favor of following the window settings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import openai\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"\n",
    "openai.api_key = os.environ[\"OPENAI_API_KEY\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "from llama_index import ServiceContext, set_global_service_context\n",
    "from llama_index.llms import OpenAI\n",
    "from llama_index.embeddings import OpenAIEmbedding\n",
    "from llama_index.node_parser import SentenceWindowNodeParser\n",
    "\n",
    "# create the sentence window node parser w/ default settings\n",
    "node_parser = SentenceWindowNodeParser.from_defaults(\n",
    "    window_size=3,\n",
    "    window_metadata_key=\"window\",\n",
    "    original_text_metadata_key=\"original_text\",\n",
    ")\n",
    "\n",
    "llm = OpenAI(model=\"gpt-3.5-turbo\", temperature=0.1)\n",
    "ctx = ServiceContext.from_defaults(\n",
    "    llm=llm,\n",
    "    embed_model=HuggingFaceEmbeddings(\n",
    "        model_name=\"sentence-transformers/all-mpnet-base-v2\"\n",
    "    ),\n",
    "    node_parser=node_parser,\n",
    ")\n",
    "\n",
    "# if you wanted to use OpenAIEmbeddings, we should also increase the batch size,\n",
    "# since it involves many more calls to the API\n",
    "# ctx = ServiceContext.from_defaults(llm=llm, embed_model=OpenAIEmbedding(embed_batch_size=50)), node_parser=node_parser)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build the index\n",
    "\n",
    "Here, we build an index using chapter 3 of the recent IPCC climate report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 20.7M  100 20.7M    0     0  20.7M      0 --:--:-- --:--:--  0:00:02 8706k--:--:-- --:--:-- 20.8M\n"
     ]
    }
   ],
   "source": [
    "!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import SimpleDirectoryReader\n",
    "\n",
    "documents = SimpleDirectoryReader(\n",
    "    input_files=[\"./IPCC_AR6_WGII_Chapter03.pdf\"]\n",
    ").load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import VectorStoreIndex\n",
    "\n",
    "sentence_index = VectorStoreIndex.from_documents(documents, service_context=ctx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Querying\n",
    "\n",
    "### With MetadataReplacementPostProcessor\n",
    "\n",
    "Here, we now use the `MetadataReplacementPostProcessor` to replace the sentence in each node with it's surrounding context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There is low confidence in the quantification of AMOC changes in the 20th century due to low agreement in quantitative reconstructed and simulated trends. Additionally, direct observational records since the mid-2000s remain too short to determine the relative contributions of internal variability, natural forcing, and anthropogenic forcing to AMOC change. However, it is very likely that AMOC will decline over the 21st century for all SSP scenarios, but there will not be an abrupt collapse before 2100.\n"
     ]
    }
   ],
   "source": [
    "from llama_index.indices.postprocessor import MetadataReplacementPostProcessor\n",
    "\n",
    "query_engine = sentence_index.as_query_engine(\n",
    "    similarity_top_k=2,\n",
    "    # the target key defaults to `window` to match the node_parser's default\n",
    "    node_postprocessors=[\n",
    "        MetadataReplacementPostProcessor(target_metadata_key=\"window\")\n",
    "    ],\n",
    ")\n",
    "window_response = query_engine.query(\"What are the concerns surrounding the AMOC?\")\n",
    "print(window_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also check the original sentence that was retrieved for each node, as well as the actual window of sentences that was sent to the LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Window: Nevertheless, projected future annual cumulative upwelling wind \n",
      "changes at most locations and seasons remain within ±10–20% of \n",
      "present-day values (medium confidence) (WGI AR6 Section  9.2.3.5; \n",
      "Fox-Kemper et al., 2021). Continuous observation of the Atlantic meridional overturning \n",
      "circulation (AMOC) has improved the understanding of its variability \n",
      "(Frajka-Williams et  al., 2019), but there is low confidence in the \n",
      "quantification of AMOC changes in the 20th century because of low \n",
      "agreement in quantitative reconstructed and simulated trends (WGI \n",
      "AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 2021). Direct observational records since the mid-2000s remain too short to \n",
      "determine the relative contributions of internal variability, natural \n",
      "forcing and anthropogenic forcing to AMOC change (high confidence) \n",
      "(WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., \n",
      "2021). Over the 21st century, AMOC will very likely decline for all SSP \n",
      "scenarios but will not involve an abrupt collapse before 2100 (WGI \n",
      "AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021). 3.2.2.4 Sea Ice Changes\n",
      "Sea ice is a key driver of polar marine life, hosting unique ecosystems \n",
      "and affecting diverse marine organisms and food webs through its \n",
      "impact on light penetration and supplies of nutrients and organic \n",
      "matter (Arrigo, 2014). Since the late 1970s, Arctic sea ice area has \n",
      "decreased for all months, with an estimated decrease of 2 million km2 \n",
      "(or 25%) for summer sea ice (averaged for August, September and \n",
      "October) in 2010–2019 as compared with 1979–1988 (WGI AR6 \n",
      "Section 9.3.1.1; Fox-Kemper et al., 2021).\n",
      "------------------\n",
      "Original Sentence: Over the 21st century, AMOC will very likely decline for all SSP \n",
      "scenarios but will not involve an abrupt collapse before 2100 (WGI \n",
      "AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).\n"
     ]
    }
   ],
   "source": [
    "window = window_response.source_nodes[0].node.metadata[\"window\"]\n",
    "sentence = window_response.source_nodes[0].node.metadata[\"original_text\"]\n",
    "\n",
    "print(f\"Window: {window}\")\n",
    "print(\"------------------\")\n",
    "print(f\"Original Sentence: {sentence}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Contrast with normal VectorStoreIndex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "from llama_index import VectorStoreIndex\n",
    "from llama_index import ServiceContext, set_global_service_context\n",
    "from llama_index.llms import OpenAI\n",
    "\n",
    "llm = OpenAI(model=\"gpt-3.5-turbo\", temperature=0.1)\n",
    "ctx = ServiceContext.from_defaults(\n",
    "    llm=llm,\n",
    "    embed_model=HuggingFaceEmbeddings(\n",
    "        model_name=\"sentence-transformers/all-mpnet-base-v2\"\n",
    "    ),\n",
    ")\n",
    "\n",
    "vector_index = VectorStoreIndex.from_documents(documents, service_context=ctx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I'm sorry, but the concerns surrounding the AMOC (Atlantic Meridional Overturning Circulation) are not mentioned in the provided context.\n"
     ]
    }
   ],
   "source": [
    "query_engine = vector_index.as_query_engine(similarity_top_k=2)\n",
    "vector_response = query_engine.query(\"What are the concerns surrounding the AMOC?\")\n",
    "print(vector_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, that didn't work. Let's bump up the top k! This will be slower and use more tokens compared to the sentence window index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The context information does not provide any specific concerns surrounding the AMOC (Atlantic Meridional Overturning Circulation).\n"
     ]
    }
   ],
   "source": [
    "query_engine = vector_index.as_query_engine(similarity_top_k=5)\n",
    "vector_response = query_engine.query(\"What are the concerns surrounding the AMOC?\")\n",
    "print(vector_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis\n",
    "\n",
    "So the `SentenceWindowNodeParser` + `MetadataReplacementNodePostProcessor` combo is the clear winner here. But why?\n",
    "\n",
    "Embeddings at a sentence level seem to capture more fine-grained details, like the word `AMOC`.\n",
    "\n",
    "We can also compare the retrieved chunks for each index!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Over the 21st century, AMOC will very likely decline for all SSP \n",
      "scenarios but will not involve an abrupt collapse before 2100 (WGI \n",
      "AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).\n",
      "--------\n",
      "Direct observational records since the mid-2000s remain too short to \n",
      "determine the relative contributions of internal variability, natural \n",
      "forcing and anthropogenic forcing to AMOC change (high confidence) \n",
      "(WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., \n",
      "2021).\n",
      "--------\n"
     ]
    }
   ],
   "source": [
    "for source_node in window_response.source_nodes:\n",
    "    print(source_node.node.metadata[\"original_text\"])\n",
    "    print(\"--------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we can see that the sentence window index easily retrieved two nodes that talk about AMOC. Remember, the embeddings are based purely on the original sentence here, but the LLM actually ends up reading the surrounding context as well!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's try and disect why the naive vector index failed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AMOC mentioned? False\n",
      "--------\n",
      "AMOC mentioned? False\n",
      "--------\n",
      "AMOC mentioned? True\n",
      "--------\n",
      "AMOC mentioned? False\n",
      "--------\n",
      "AMOC mentioned? False\n",
      "--------\n"
     ]
    }
   ],
   "source": [
    "for node in vector_response.source_nodes:\n",
    "    print(\"AMOC mentioned?\", \"AMOC\" in node.node.text)\n",
    "    print(\"--------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So source node at index [2] mentions AMOC, but what did this text actually look like?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nevertheless, projected future annual cumulative upwelling wind \n",
      "changes at most locations and seasons remain within ±10–20% of \n",
      "present-day values (medium confidence) (WGI AR6 Section  9.2.3.5; \n",
      "Fox-Kemper et al., 2021).Continuous observation of the Atlantic meridional overturning \n",
      "circulation (AMOC) has improved the understanding of its variability \n",
      "(Frajka-Williams et  al., 2019), but there is low confidence in the \n",
      "quantification of AMOC changes in the 20th century because of low \n",
      "agreement in quantitative reconstructed and simulated trends (WGI \n",
      "AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 2021).Direct observational records since the mid-2000s remain too short to \n",
      "determine the relative contributions of internal variability, natural \n",
      "forcing and anthropogenic forcing to AMOC change (high confidence) \n",
      "(WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., \n",
      "2021).Over the 21st century, AMOC will very likely decline for all SSP \n",
      "scenarios but will not involve an abrupt collapse before 2100 (WGI \n",
      "AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).3.2.2.4 Sea Ice Changes\n",
      "Sea ice is a key driver of polar marine life, hosting unique ecosystems \n",
      "and affecting diverse marine organisms and food webs through its \n",
      "impact on light penetration and supplies of nutrients and organic \n",
      "matter (Arrigo, 2014).Since the late 1970s, Arctic sea ice area has \n",
      "decreased for all months, with an estimated decrease of 2 million km2 \n",
      "(or 25%) for summer sea ice (averaged for August, September and \n",
      "October) in 2010–2019 as compared with 1979–1988 (WGI AR6 \n",
      "Section 9.3.1.1; Fox-Kemper et al., 2021).For Antarctic sea ice there is \n",
      "no significant global trend in satellite-observed sea ice area from 1979 \n",
      "to 2020 in either winter or summer, due to regionally opposing trends \n",
      "and large internal variability (WGI AR6 Section 9.3.2.1; Maksym, 2019; \n",
      "Fox-Kemper et al., 2021).CMIP6 simulations project that the Arctic Ocean will likely become \n",
      "practically sea ice free (area below 1 million km2) for the first time before \n",
      "2050 and in the seasonal sea ice minimum in each of the four emission \n",
      "scenarios SSP1-1.9, SSP1-2.6, SSP2-4.5 and SSP5-8.5 (Figure 3.7; WGI \n",
      "AR6 Section 9.3.2.2; Notze and SIMIP Community, 2020; Fox-Kemper \n",
      "et al., 2021).Antarctic sea ice area is also projected to decrease during \n",
      "the 21st century, but due to mismatches between model simulations \n",
      "and observations, combined with a lack of understanding of reasons \n",
      "for substantial inter-model spread, there is low confidence in model \n",
      "projections of future Antarctic sea ice changes, particularly at the \n",
      "regional level (WGI AR6 Section  9.3.2.2; Roach et  al., 2020; Fox-\n",
      "Kemper et al., 2021).3.2.3 Chemical Changes\n",
      "3.2.3.1  Ocean Acidification\n",
      "The ocean’s uptake of anthropogenic carbon affects its chemistry \n",
      "in a process referred to as ocean acidification, which increases the \n",
      "concentrations of aqueous CO 2, bicarbonate and hydrogen ions, and \n",
      "decreases pH, carbonate ion concentrations and calcium carbonate \n",
      "mineral saturation states (Doney et  al., 2009).Ocean acidification\n"
     ]
    }
   ],
   "source": [
    "print(vector_response.source_nodes[2].node.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So AMOC is disuccsed, but sadly it is in the middle chunk. With LLMs, it is often observed that text in the middle of retrieved context is often ignored or less useful. A recent paper [\"Lost in the Middle\" discusses this here](https://arxiv.org/abs/2307.03172)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
